Skip to main content

Applications of Differentiation

Tracking the Sign of the Derivative

If ff is a function, then the sign of its derivative, ff', indicates whether ff is increasing (f>0f'>0), decreasing (f<0f'<0), or zero. If the derivative takes the value 0 at a certain point x0x_0 then the function has maximum, minimum, or a saddle point at x0x_0.

Details

If ff is a function, then the sign of its derivative, ff', indicates whether ff is increasing (f>0f'>0), decreasing (f<0f'<0), or zero. ff' can be zero at points where ff has a maximum, minimum, or a saddle point.

If f(x)>0f'(x)>0 for x<x0x< x_0, f(x0)=0f'(x_0)=0 and f(x)<0f'(x)<0 for x>x0x>x_0 then ff has a maximum at x0x_0.

If f(x)<0f'(x)<0 for x<x0x< x_0, f(x0)=0f'(x_0)=0 and f(x)>0f'(x)>0 for x>x0x>x_0 then ff has a minimum at x0x_0.

If f(x)>0f'(x)>0 for x<x0x< x_0, f(x0)=0f'(x_0)=0 and f(x)>0f'(x)>0 for x<x0x< x_0 then ff has a saddle point at x0x_0.

If f(x)<0f'(x)<0 for x<x0x< x_0, f(x0)=0f'(x_0)=0 and f(x)<0f'(x)<0 for x<x0x< x_0 then ff has a saddle point at x0x_0.

Examples

Example

If ff is a function such that its derivative is given by

f(x)=(x1)(x2)(x3)(x4),f'(x) = (x-1)(x-2)(x-3)(x-4),

then applying the above criteria for maxima and minima, we see that ff has maxima at 11 and 33 and ff has minima at 22 and 44.

Describing Extrema Using ff''

x0x_0 with f(x0)=0f'(x_0)=0 corresponds to a maximum if f(x0)<0f''(x_0)<0 - x0x_0

x0x_0 with f(x0)=0f'(x_0)=0 corresponds to a minimum if f(x0)>0f''(x_0)>0

Details

If f(x0)=0f'(x_0)=0 corresponds to a maximum, then the derivative is decreasing and the second derivative cannot be positive, (i.e. f(x0)0f''(x_0)\leq 0). In particular, if the second derivative is strictly negative, ( f(x0)<0f''(x_0) <0 ), then we are assured that the point is indeed a maximum, and not a saddle point.

If f(x0)=0f'(x_0)=0 corresponds to a minimum, then the derivative is increasing and the second derivative cannot be negative, (i.e. f(x0)0f''(x_0) \geq 0).

If the second derivative is zero, then the point may be a saddle point, as happens with f(x)=x3f(x)=x^3 at x=0x=0.

The Likelihood Function

If pp is the probability mass function (p.m.f.):

p(x)=P[X=x]p(x) = P [X = x]

then the joint probability of obtaining a sequence of outcomes from independent sampling is:

p(x1)p(x2)p(x3)p(xn)p(x_1) \cdot p(x_2) \cdot p(x_3) \ldots p(x_n)

Suppose each probability includes some parameter θ\theta, this is written:

pθ(x1),pθ(xn){p_{\theta}}(x_1), \ldots {p_{\theta}}(x_n)

If the experiment gives x1,x2,xnx_1, x_2 \ldots, x_n we can write the probability as a function of the parameters:

Lx(θ)=pθ(x1),,pθ(xn)L_{\mathbf{x}}(\theta) = p_{\theta}(x_1), \ldots, p_{\theta}(x_n)

This is the likelihood function.

Details

Definition

Recall that the probability mass function (p.m.f) is a function giving the probability of outcomes of an experiment.

We typically denote the p.m.f. by pp so p(x)p(x) gives the probability of a given outcome, xx, of an experiment. The p.m.f. commonly depends on some parameter. We often write

p(x)=P[X=x]p(x) = P [X = x]

If we take a sample of independent measurements, from pp, then the joint probability of a given set of numbers is:

p(x1)p(x2)p(x3)p(xn)p(x_1) \cdot p(x_2) \cdot p(x_3) \cdot \ldots \cdot p(x_n)

Suppose each probability includes the same parameter θ\theta, then this is typically written:

pθ(x1),,pθ(xn){p_{\theta}}(x_1), \ldots, {p_{\theta}}(x_n)

Now consider the set of outcomes x1,x2,xnx_1, x_2 \ldots, x_n from the experiment. We can now take the probability of this outcome as a function of the parameters.

Definition

Lx(θ)=pθ(x1),,pθ(xn)L_{\mathbf{x}}(\theta) = p_{\theta}(x_1), \ldots, p_{\theta}(x_n)

This is the likelihood function and we often seek to maximize it to estimate the unknown parameters.

Examples

Example

Suppose we toss a biased coin nn independent times and obtain xx heads, we know the probability of obtaining xx heads is:

(nx)px(1p)nx\displaystyle\binom{n}{x}p^x (1-p)^{n-x}

The parameter of interest is pp and the likelihood function is:

L(p)=(nx)px(1p)nxL(p) = \displaystyle\binom{n}{x}p^x (1-p)^{n-x}

If pp is unknown we sometimes wish to maximize this function with respect to pp in order to estimate the true probability pp.

Plotting the Likelihood

missing slide -- want to give a numeric example and plot LL

Examples

missing example -- want to give a numeric example and plot LL

Maximum Likelihood Estimation

If LL is a likelihood function for a p.m.f. pθp_{\theta}, then the value θ^\hat{\theta} which gives the maximum of LL:

L(θ^)=maxθ(Lθ)L(\hat{\theta}) = \max_\theta ({L}_\theta)

is the maximum likelihood estimator (MLE) of θ\theta.

Details

Definition

If LL is a likelihood function for a p.m.f. pθp_{\theta}, then the value θ^\hat{\theta} which gives the maximum of LL :

L(θ^)=maxθ(Lθ)L (\hat{\theta}) = \max_\theta ({L}_\theta)

is the maximum likelihood estimator of θ\theta.

Examples

Example

If xx is the number of heads from nn independent tosses of a coin, the likelihood function is:

Lx(p)=(nx)px(1p)nxL_x(p) = \displaystyle{n \choose x} p^x (1-p)^{n-x}

Maximizing this is equivalent to maximizing the logarithm of the likelihood, since logarithmic functions are increasing. The log-likelihood can be written as:

(p)=ln(L(p))=ln(nx)+xln(p)+(nx)ln(1p)\ell(p) = \ln (L(p))= \ln \displaystyle\binom{n}{x} + x \ln (p) + (n-x) \ln (1-p)

To find possible maxima, we need to differentiate this formula and set the derivative to zero:

0=ddp=0+xp+nx1p(1)0 = \displaystyle\frac{d \ell }{dp} = 0 + \displaystyle\frac{x}{p}+\displaystyle\frac{n-x}{1-p}(-1)

0=p(1p)(x)pp(1p)nx1p0 = p(1-p) \displaystyle\frac{(x)}{p} - p(1-p) \displaystyle\frac{n-x}{1-p}

0=(1p)xp(nx)0 = (1-p)x - p(n-x)

0=xpxpn+px=xpn0 = x - px -pn + px = x-pn

So:

0=xpn0 = x-pn

p=xnp = \displaystyle\frac{x}{n}

is the extreme and so we can write:

p^=xn\hat{p} = \displaystyle\frac{x}{n}

for the MLE.

Least Squares Estimation

Least squares: Estimate the parameters θ\theta by minimizing:

i=1n(yigi(θ))2\displaystyle\sum_{i=1}^{n}{(y_i - g_i (\theta))^2}

Details

Suppose we have a model linking data to parameters. In general we are predicting yiy_i as gig_i (θ\theta).

In this case it makes sense to estimate parameters θ\theta by minimizing:

i=1n(yigi(θ))2\displaystyle\sum_{i=1}^{n}{(y_i - g_i (\theta))^2}

Examples

Example

One may predict numbers, xix_i, as a mean, μ\mu, plus error. Consider the simple model xi=μ+ϵix_i = \mu + \epsilon_i, where μ\mu is an unknown parameter (constant) and ϵi\epsilon_i is the error in measurement when obtaining the ithi^{th} observations, xix_i, i=1,,ni=1,\ldots, n.

A natural method to estimate the parameter is to minimize the squared deviations:

minμi=1n(xμ)2\min_{\mu} \displaystyle\sum_{i=1}^n \left (x - \mu \right )^2

It is not hard to see that the μ^\hat{\mu} that minimizes this is the mean:

μ^=xˉ\hat{ \mu} = \bar{x}

Example

One also commonly predicts data y1,,yny_1, \cdots,y_n with values on a straight line, i.e. with α+βxi\alpha + \beta x_i, where x1,,xnx_1, \ldots, x_n are fixed numbers. This leads to the regression problem of finding parameter values for α^\hat{\alpha} and β^\hat{\beta} which gives the best fitting straight line in relation to least squares:

minα,β(yi(α+βxi))2\min_{\alpha,\beta} \displaystyle\sum \left ( y_i - ( \alpha + \beta x_i) \right ) ^2

Example

As a general exercise in finding the extreme of a function, let's look at the function f(θ)=i=1N(xiθ3)2f(\theta)=\displaystyle\sum_{i=1}^N(x_i\theta -3)^2 where xix_i are some constants. We wish to find the θ\theta that minimizes this sum. We simply differentiate θ\theta to obtain:

f(θ)=i=1n2(xiθ3)x1=2i=1nxi2θ2i=1n3xif'(\theta) = \displaystyle\sum_{i=1}^n 2(x_i\theta -3)x_1 = 2\displaystyle\sum_{i=1}^n x^2_i\theta - 2\displaystyle\sum_{i=1}^n 3x_i

Thus:

f(θ)=2θi=1nxi22i=1n3xi=0θ=i=1n3xii=1nxi2\begin{aligned} f'(\theta) &= 2\theta \displaystyle\sum_{i=1}^n x^2_i-2\displaystyle\sum_{i=1}^n 3x_i=0 \\ & \Leftrightarrow \theta = \displaystyle\frac{\displaystyle\sum_{i=1}^n 3x_i}{\displaystyle\sum_{i=1}^n x^2_i} \end{aligned}